Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation